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(54) Computer system and (compeer-implemented process for classifying records in a computer 
database . 



(57) Data flow for the process of segmentation of ra . 
database is managed by an analysis table created and 
maintained within the database. Data are processed 
within the database: Segment definitions are stored in 
ore or more tables created in the database as a result 



of the segmentation process. Thie analysis table may in- 
clude a field containing a random number The random 
number may be used ip subsahnple the records in the 
analysis table in order to limit the number of records, 
thus reducing processing time: while maintaining a sam- 
ple size which is statistically significant. 
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Description 

The present invenlion relates to computer systems and computer processes lor ciassilyinq data ,n a database into 
segments 

A rapidly developing area ol technology in the field of computer database systems is the classiticaton ol data into 
segments according to target condition Such a classification rfiay be used, for example to make predictions This 
classification is sometimes referred to as data mining In general, a database classification systerr identifies segments 
or groups of entities in a database according to how well they meet a target condition of interest Each segment is 
defined by a selection criterion e g ' a rule, which is determined as part of the classification process The selection 
criterion defining a segment may be considered as a search query on fields' in' the database .which may have a rela- 
tionship to the target condition of interest. Such fields are called predictors Database entities matching a selection 
criterion are in the segment defined by the selection criterion 

There are many l^inds of systems which classify entities in a database into segments Some of the more commonly 
used classification systems include neural networks, genetic" algorithm classifiers/and classification and regression 
trees (CART) The use of CART is descnbed in detail in Classification and Regression Trees by L Breiman el al 
(Belmont. California: Wadsworth International Group, 1984). A classification system is also shown m u S Patent No 
4,719.571. 

There are several Implementations ^of CART most of which process data by extracting and analysing a flat file of 
records from a database When the database is relational, extraction of a flat file eliminates processing required to join 
numerous tables in the database. Implementations of CART also typically perform a depth-first search of the predictor 

Space. * . ■ ■ ^ ^-^ ... 

One problem with typical implementations of CART is that, in the initial stages of the algorithm, much more data 
IS processed than is needed to obtain statistically significant results. Although a technique called subsampling may be 
used to reduce processing time, the entire flat file is nonetheless extracted from the database. This extraction involves 
wasteful and unnecessary copying of data. Since such classification is often performed on very large databases e q 
having terabytes of data, such copying is irhpractical. • -" ' 

.. Another problem with typical implerinentations of C ART is thaii a depth-first search does not prov.de an indication 
of a best classification at any level of the tree until the last branches 'of the tree are being analyzed. On the other- hand 
the depth-first search is performed on an increasingly smaller d^ta'-sets of the sime data, which improves memory 
management on the computer system. While a breadth-search may provide a best classification along any level of a 
search, such a seaich results in memory management problems because generally the.entire data set is processed 
at each-level of the search ; ■ ' - 

Accordingly, a general aim of this invention is to provide a system ahcJ method for.classifyir^g entities in a database 
into segments while operating on data in the database and without extractirig 9 flat fiie 

V _ In the present invention, the flow of data usedncJ diassify i database into segments is managed by an analysis 
.table created and maintained within the database: The analysis table includes an identifier for each entity of the data 
. set sampled from the database and an indicator of the segment in which it is coritained. The analysis table also indicates 
the data in the target field. Using the analysis table. data are processed Within the database to.gene^ate definitions of 
the segments One or more tables are created in the database as a result of the segmentation process to identify the 
-»C segmentation. ■ . , . . - • .. •. .- i - . . ,. , 

In one embodiment of the invention, the analysis table includes a field containing a random number The'ra'ndom 
number may be used to subsample the records in the analysis^ table in order to limit the number of records to be 
processed thus reducing processing time, while maihtaihihg a sanripie size which is statistically sigmfi^cant For exam- 
ple, all records in a particular nods having a randoni number lesk than a given threshold can be Selected for the purpose 
of determining the best split of the . node. By adjusting the' threshold applied to the random number according to a 
desired number of recoi^ds and the number of records in the node, the sample size can be controlled 

Additionally;- the analysis table may enhance memory management during the classification process by using the 
segment number stored lor each record to limit the numbs, of segments being classified. Thijs. ihe classification proc- 
ess may divide only selected segments into further isegrnenls. . 

Another aspect of the invention is a system for classifying a daf4ba;se into segments and for storing definitions of ' 
the segments in one or more tables in the database, the tables may ihclud^ statistical information associated with the 
segmentatiQn .Storing segment definitibns in a- database allbws f^r defir^i'iians.to be accessed in Ihe same manner as 
Other data ■ * . / ■ . • * .: ' • - • • 

. , Another aspect ofthe invention is a class if Icatiori systenn whiehcl4^sifidi a database. into segments usr^ig a sample 
of entities of the. database. -When a segmentation has been'^deiermlned'using this sample, segments having a sub- 
stantial likelihood of having further significant segmentation are analyzed using a new sample of database entities in 
•that segment . ■■. ^ ■■• - o 

The various combinations of one or more aspects of the present invention, and the embodiments thereof are also 



10 



15 



20 _ 



2S 



30 



35 



45 



50 



55 



BNSOOCID: <eP O797160A2 J > 



2 



EP0 797 160 A2 



fO 



IS 



20 



25 



30 



35 



40 



45- 



SO 



55- 



aspects ot the prosoni invenlion ll should bo understood that those of ord;nary skill in tho can donvo other errtaod- 
imenl?; ol the invonlion from the followino detailed description of nn example embodiment 

The invention will now be descnbod ty way of example with reference to the accompanying drawings - 

FIG 1 IS a perspective view of a comouler system for an embodiment with the invention 

FIG 2 IS a block diagram of a computer system: 

FIG 3 is a schematic diagram of the memory system shown in FIG. 2: 

FIG. 4 is a block diagram illuslraling a database classification system. 

FIG. 5 is an illuslration.of a segment definition, including a selection criterion and an associated statistical measure 

FIG. 6 is a diagram of an example decision tree such as may be generated using classification and regression trees 

FIG 7 IS a table illustrating an analysis table used in one embodiment of this invention: 

FIG. 8 is a flowchart describing segmentation of a database using classification and regression trees: 

FIG. 9 is a flowchart describing how a decision tree is developed in FIG. 8: 

FIG. 1 0 is a flowchart describing how data is split in FIG. 9: . - 

FIGS. 11A-11G are tables illustrating definitions and descriptions of segments: and 

FIG. 12 IS a flowchart diescnbing howsegmeht descriptions are written to the tables of FIGS 11A-HF 

The present invention will be more completely. understood through the following detailed description which should 
be read in conjunction with the attached drawing in which simitar reference numbers indicate similar structures. 

Referring now to FIG. 1 .. a computer system 20 includes an output device 24 which displays information lo a user. 
The computer system includes a main unit 22 connected to the output device 24 and an input device 26, herein shown 
as a keyboard. As shown in FIG. 2. the main unit 22 generally includes a -processor 28 connected to a memory system 
30 via all interconnection mechanisrh 32. The input device 26 is also connected to the processor and memory system 
via the connection mechanism, as is the output device 24. 

It should be understood that one or. more output devices may be connected to the computer system Example 
output devices include a cathode ray tube (CRT) display, liquid crystaf displays (LCD), printers, communication devices 
such as a modem, and audio output. 11 should also be understood that one or more input devices 26 may be connected 
to the computer system. Example input devices include a keyboard, .keypad, track ball, mouse, pen and tablet com- 
munication device, audio input and scanner, ft should be understpod-the ioventbnjs not limited to the particular input 
or output devices used in combination with the computer system or toJhose described herein. - 

The computer system 20 may be a general purpose computer system which is programmable using a high level 
computer programming languagersuch "as "C. or "Pascal". The computer system may also be specially programmed 
special purpose hardware. In a generaj purpose computer system, the.processor. is .typically a commercially available 
processor, of which the series x86 processors, available from Intel, and the^66OX0 series microprocessors available 
from Motorola are examples. Many other processors are available. Such a microprocessor executes a proQrarri called 
•an operating system, of which UNIX, DOS and VMS are. examples, which controJs the execution -of other computer 
■ programs and provides scheduling; debugging. inpuVoutputcontrol. accounting. compilation, storage assignment data 
management and memory nhanagement, and communication control and related services. The processor and operating 
system define a computer platform for which application programs in.high.level programming languages are written 

It should be understood the invention is not limited to a particular computer platform, particular processor, or par- 
ticular high-level programming language.. Additionally the computer system 20 may be a multiprocessor computer 
system or may include multiple computers ponnected over a copnputer network: . : 

An example memory system 30 will now be described in more detail in connection with FIG. 3. A memory system 
typically includes a computer readable and writeable nonvolatile recording medium 34. of which a magnetic disk a 
flash memory and tape are examples. The disk may be removable, known as a floppy disk, or permanent, known as 
a hard drive. A disk, which is shovyn in FIG. 3,has a number pf tracks. as indicated at 36. in which signals are stored 
typically in binary form.; i.e., a.form interpreted as a sequence of one and zeros such as shown at 40. Such signals 
may define an application program to be executed by the microprocessor, or information stored on the disk to be 
processed by the application program. Typically, in operatipn, the processor 28 causes data to be read from the non- 
volatile recording medium 34 into an integrated circuit memory, element 33. .which is typically a volatile, random access 
memory such as a dynamic randonr? 'Recess rriemory (DRAM) or static memory. (SRAM). The integrated circuit memory 
element 38 allows for faster access to the., information by the processor than does the disk 34. The processor generally 
manipulates the data within the integrated circuit memory 38 and then copies the data to the disk 34 when processing 
IS completed. A variety of mechar)jsms are known for managing data movement between the disk 34 antfthe integrated 
circuii memory element 36. and the invention is.not, limited thereto. It should also be understood that the invention is 
not limited to a particular memory systern. . . . 

FIG. 4 is a block diagram illustrating a system for classifying data in a database into segments, in one embodiment 
of the invention. This system includes a database 60 which includes a data repository 52 for storing the data of the 
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database which may include memory system such as shown at 30 in FIG 3. The dalnbasc 60 also includes a database 
manaoemeni system 54 which controls access to the data and handles copying of data between the different memory 
. levels of the memory system 30. ' 

Such a database management system typically provides access through a standard 'interface, for example the 
Structured Query Language (SQL), such as described in SQL: The Structured Query Language . by C.J. Hursch el al 
{Blue Ridge Summit. Pennsylvania; TAB Books 1963). which Is hereby incorporsted by reference. An example re!a> 
tipnal database management system Is the ORACLE -7 database', having an ODBC interface. Such a database is 
accessed using SQL queries encapsulated m an ODBC message sent to tile database sen/er by a client program 
Although the follcwing description is given in terms of a particular example, using a relational database on a database 
server, accessed using SQL by a client program, it should be understood that the invention is not limited thereto and 
that other kinds cf databases and other access mechanisms may be used. " 

The present invention also is useful with database management systems designed for operation on a machine 
with parallel processors. Such DBfVIS's include the DB2 and ORACLE? database management systems, available 
from International Business Machines and Oracle, respectively. * - 

A relational database is organized using tables of recoras. Each record contains related information that is organ- 
ized into fields. Typical examples are customer lists wherein each customer has one or more.records containing fields 
tor address, telephone number, and so on Some fields include variables such as total purchases, payments made 
each month and so on. Each item iri the database that has a unique identifier, such as a customer is considered an 
entity in the database. Each entity may have one or more associated records. The data in a database may be of many 
types, including simple and compound data types. Simple data types include numbers, dales and strings, for example 
, while C9mpound data types include sets and arrays, for example. Simple data types are represented as records that 
associate a unique identifier .with a single value while compound data types associate a unique identifier with rfiultiple 
values. When the compound data type is an array, additional fields are used to index each value. The different types 
of data may involve different types of .queries and other operations. 

It is desirable to apply data warehousing techniques to the database 60 prior to classification by a database clas- ' 
sification system. Such techniques involve ensuring that the data in the database are complete and correct arid have 
correct. formats. A database classification system 56 retrieves data 56 from the database 60 and classifies the data to 
determine segment definitions 50. described below Jn Qonn action ..with FIG. 5. Classification and regression trees 
(CART) are one technique used to create segment definitions. Classification and regression trees are described in 
more detail in Classification and Repr ession Trees bv L Breirihan. et al. (Belmont, California: Wadsworth International " 
Group. 1984). which is hereby incorporated by reference. . 

Referring now to FIG. 5. a segment definition 50 includes a representation 62 of a selection criterion applied to 
the data in the database. An example of a selection criterion is a rule. A segment definition also may include an indication 
64 of an associated statistical measure, e.g.. a probability, an average measure or an average rate for a segment entity, 
which indicates how well entities in a segrhent meet some target condition. Other values related to a probability, such 
as the standard deviation, variance and.mean may also be determined. An indication of the target may also be provided. 

. Referring now to FIG, 6. using CART to determine segmentation of a database involves generating a decision tree 
70 representing the segment definitions. The decision tree includes a nurnber of nodes, numbered here sequentially 
from 1 to 1 5 in a breadth-first order Each node represents a segment and has an associated selection criterion e g 
node 8 represents a segment in which each entity has a>3, b=M and 1 <$30.000. Using CART-data in a segment are 
split into two (or more) segments according to the predictor that provides the best statistical groupirig or split of the 
data. Nodes which are branches of another.node indicate the best split of the othernode and have a segment definition 
that is narrower than the segment definition of the other node. For example^ node 1 may represent all customers. Node 
2 may represent customers which made fewer than three purchases, whe'reas node 3 may represent customers which 
made 3 or rnore purchases, where the target condition is a total of purchases over a given amount*. ' 

As described above, the database classification system 56 process data in the relational database 60 to define a 
segmentation such as represented by the decision tree. in FIG.. 6. An example implementation will now be described 
in connection, vyilh FIGS. 7-12 and the attached Appendices l-Vlll. " 

In order to avoid extraction of a large flat file of records from a database, in the present invention data are processed 
so and classified within the database. In.order tpiperform such a classification in the database, a structure is constructed 
in the database to rnaintain an indication of the sample on which ciassificatibn is being performed and the segment or 
node in which^ e^ch record in the sanriple is classified. Such a structure is illustrated in FIG. 7; herein called a an analysis 

tablelOO , . , «ir. ■'. • :^ 

The analysis table 100 includes a row for each unique- identiffer .102:for each entity in the sample'selebtGcl from 
the database for classification. The columns of this table include the unique identifier -102 and an indication 104 of 
whether this. entity meets a target condition. A random number 106. e.g.. between 0 and 1. is also assigned in one 
ennbodiment of the invention. The use of these random numbers will be described in mdre detail below. A segment 
number, as indicated 108. indicates the present segment or node in the decision tree, such as in FIG. 6. to which the 
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. enlily is assignee upon comploiion of classiliCHtion each enlily in iho nnalysis table is assigned to a leal node of tfio 
decision tree Tho r\'^\H sfiown in thoanalysis table 1 00 of Fin 7 arc merely exomplary and are not inlonded lo iliustrato 
actual data Unliko the prior art. the data in (he records for the entities of the sample being analyzed arc nci extracted 
from the database into a flat file Rather, the analysis table ICQ is created and is maintained in the database 

An example process using an analysis table 100 (FIG. 7) and classification and regression trees to generate a 
segmentation wil now be descrbed in connection with FIG. 8 This example is applicable to a relational database 
accessed using SQL. A computer system may be programmed to perform this process using a hign level programming 
language with embedded SOL commands for accessing a relational database In this example, the target field in the 
analysis table is assumed lo be numeric and lakes only one of two values, e g. . either 1 or 2. Categoncal targets having 
only two categories e.g.. male or female, are also converted to a numeric target. Those of ordinary skill in this art 
should be able lo adapt the foiiouing example lo continuous targets, other database types and database access mecn- 
anisms given the following detailed description. Input used to begin classification may include information required for 
connection lo the database, the rame to be used for the analysis table, the name to be used for the segment descriptor 
and associated tables described below, a maximum allowed run time, a name of a target field and allowed predictor 
fields. 

The first step of the process nvoives verifying capabilities of the database in step 1 20. This step involves venfying 
thai the database used supports perrnissions such as the ability to create, update and drop tables 

The next steo of the process. is step 122 of creating a an analysis table 100 in the database being accessed. 
Example SQL queries for creating an analysis table in step 122 are shown in Appendix VIII. After creating- the analysis 
table, it may be cesirable to veri/y that the ariaJysis table and fields to be used for predictors are accessible^' from the 
database and that tile analysis table has. enough entities for classification. This step may involve, in this example, 
ensuring that the target field is dichotonious with values of. 1 and 2 and . with no missing values. This steo also may 
involve ensuring that the predictors have enough corresponding values- to be useful. 

A node table is also created by the database classification system, but not necessarily as database table. The 
node tabic may tc implemented as an array.. indexed, by node numbcr-that indicates, for each node N of a decision 
tree (such as in FIG. 6). a number nO of ontitiesjhat meet the target and the number n1 of entities that do not meet 
the target. The test defining the split of the^node is .also stored. Thus,.an example node table is as follows - " 



•rni,ode:y 


nO 




test - 


1 


50,000 


30,000 . 


a>.3 . 


.-.3.-; ' •* 


25,000 


1.5,000., 


. b<2 . 


3--... 


15,000 


'25.000 


.c>12.. 



The next step of the process is step t124.pf generating a decision tree using the data set identified by the ar^alysis 
table 1 00. An example. processior generating the decision tree is described in'more detail below ih' connection with 
fig: 9. In this process, the random nunnbef field in.the analysis table may be used to generate subiampl'es of the data 
set being processed, in a manner described below. Another kind of-samplrng involves the generation of train; test and 
evaluate sets of data, if desired. In p„artiGular:..each of these setsiof data^can be generated by adding a column to the" 
analysis table for indicating the set in which an entity belongs. : ' 

: After-a decision tree is generated it may also be "pruned^ as indicated in step 1 28. Although this fibwchart indicates 
that pruning is done after Jhe tree-; has.been , generated' ill step 124. it is possible to prune the tree during the process 
of generating tbe tree. Pruning the tree, generally involves identifying, for each node, the range of subtrees, if any in 
which there is.a leaf. This identijicatiop. r^tijres repeated passes- over -the tree data structure, one pass per subtree; 
The first pass computes the diversity within a given node. Each-subsequent pass updates the estimates of total diversity 
at thejeaves subtended by a node, and identifies those nodes whose splits are to be removed to produce the next 
subtree. Node diversity may be measured, for example, by a-Gihi-metric or any other known diversity metric. The node 
table may be used lo perform .this step. ' . 

Next, jn.step..l30, a.subtree:Vyith the minimum error rate oh-a test subset of the data may be selected as the best 
. tree. This step involves a compa/isori of. the statistics gathered for different data sets such as found in the node table, 
and in essence eliminates or cpnfirrns the selections of the splits of the nodes. This selection process uses the range 
of subtrees for each node in the tree to accumulate within-class error rates for each subtree for both test and evaluation 
subsets. The total error of- each subtree is computed and used to select the best tree (i.e., the one standard error tree) 

Given a selected tree, whether. as generated in step 124 or as pruned or selected as in steps 128 or I'SO. the" 
description of the segments may be written in one or more tables in the relational database in step 132. This step is 
described in niore detail in conriection with. FIGS. 11A-11G and FIG. 12. and involves, for each teaf in the selected 
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iree composing and writing 'ihe descrrptton oi each leaf node in the decision tree into one or more tables such as 
described below in connection with FIGS 11A-11G 

The process of generating a decision tree, step 124 in FIG 8. will now be described in connection with FIG Q 
This Iterative process finds a best split of a segment according to the given predictors and updates the analysis table 
100 and node.lable described above according to the best split. The first step of this orocess is initialization of the 
analysis table 100 in step 140. In particular, all of the records in the analysis table 1 00 are assumed to be in an initial 
segment, e.g.. segment 1. 

A segment number or a range of segment numbers is then selected in step 141. as indicated by minimum leaf 
node and maximum leaf node numbers. Initially, the minimum leaf node number is 1 and the maximum leaf node 
10 indicates the highest segment number of the current level of the tree, which will be less than or equal to (2")-1 where 
n is the number of the current level of the tree. These values are used because the generation of the decision tree 
terminates when there are no more updatable nodes. Example I includes an example SQL instruction for selecting the 
node numbers to be evaluated. 

'5 EXAMPLE! 

SELECT distinct segment 
FROM analysis 

WHERE segment >= minieqfnode 
20 ORDER by segment 

The next step is gathering statistics on the current leaves of the decision tree, in step 142. The step involves 
creating a column in the database that counts how many entities there are in.each segment for every target An example 
SQL query is below in Example II: 
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EXAMPLE II 



CREATE TABLE targsum (segment INT, nO INT, n I INT) PC.TFREEQ:. 
INSERT INTO targsum (segmeint, nO. nl)' 

SELECT segment, SUM {24arget). iSUI^(target-l) ■ " ' 

FROiy^ ANALYSIS s ^ 
GROUP BY segment: 



The* contents of this table are used to update the node table descrlbed.above./ 
55. - A current node number is tKe set of step 1 43. The cu rrent node number is initially one and generally is incremented 
by one on each iteration. The current node number may also be selected from a queue-such as may be generated bv 
the SQL statement of Example I. ' " : 

The data in the current node are then split into two separate, segments in step 144. Splitting the data of a node 
■involves computing the counts of eachValue of the target by node, locipihg over the available predictors and performing 
either a categorical or ordered split based on the predictor's type. The predictor providing the best split of the node is 
determined by the ability of the predictor to reduce the diversity of each node. In this example, node diversity is meas- 
ured by the Gini metric. There are many other diversity metrics which may be used to split nodes. The step of finding 
the best split of each node is described' in more detaiil below in connection with FIG. 10.. 

In one embodiment of the invention, either a categorical dichotomous split or ordered dichotomous split is per- 
formed based on the type of the predictor. For a cateSonca! dichotomous split, the cumulative counts of each value of 
the target are computed for each unique count of class 1 of each unique value of the predictor, organized by node 
Each of these values is combined with the counts of each value of the target by node to compute the reduction of 
diversity of each node, using that value for a split. The predictor that maximizes the reductbn in diversity of a node is 
selected as the best split of the node. The set of values of the predictor that defines the left side of each split is then 
so obtained. The test defining the left side of this split is .stored in the node table described above: . ^ • 

For a dichotomous ordered split, the cumulative counts. of e^ch value of the target for each predictor value are 
computed, and organized by node. Each of these values is combined 'With; the counts off each value- of the target by 
node to compute the reduction in diversity of each node, using that value for a split. The predictor which maximizes 
the reduction in diversity of a node is selected as the best split of the node. The value of the predictor that defines the 
left side of the split is obtained. The test defining the left side of this split is stored in the node table described above 
If the current node number is the maximum leaf node number, or the -last node in a queue, as determined in step 
145. the analysis table 100 is updated in step 146. Otherwise the next node is selected and its best split is found by 
repeating steps 143 through 145. 
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tn ordoi to update the analysiG table 100 in.^tep 146 for cnch node Ihcil ts split and lor whiL'h :-in olfocttvo split w.-is 
found Pin ;4pprnpri;4to loi^t is pfirformod For oxP^mple oilhor h r>ol test for membership in ;-i left side sptit for the node 
(for calogoncpil prodiclors) or a less than test for membership in the left side of a split for the node (for ordered predictors) 
may be performed on each entity defined in the analysis table When the analysis table ts updated, il no good split is 
found on a node then the entitles tn that node are not updated since there is no split. A set of exariiple SOL statements 
for updating the analysis table is below in Example III- 

EXAMPLE III ' 

To Update Analysis Table: 

Update Analysis set segment = 2* segment 
Where segment >= minleqfnode 

For the First Split: 

Update Analysis set segment = i + segment ' - 

Where segment => i and ' . 

Key in (select key from database where test_node_l) 
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For other levels. 



(for node 2 and 3 split) ' 
Update analysis set segment =1 + segnrient ' " 

Whore key in (select kcv from database where segment 4 and tcst_node_3) 

After the analysis table is updated, new minimum and maximum leaf node numbers are selected in step 147 for 
the next level of the tree to be processed.. For example, the minimum.leaf node.number is doubled and the maximum 
leaf node number greater than the mininSum leaf node number is obtained from the analysis table; The- SQL statement 
in Example I can be used to create a queue of nodes to be split. Since the node number associated with a segment 
which was not split stays the same and drops below the minimum leaf node number, such a node are is not processed 
in further iterations. A termination condition is then evaluated in step 148. If the termination condition is not met a next 
level of the tree is processed by repeating steps 142 through 148. The termination condition may be met by running 
out of nodes to split or by running out of time. For example, no more nodes are left to split when the maximum segment 
number obtained is less than the new minimiJm leaf node number after the analysis table has been updated A suitable 

"SQL statement to obtain the maximum leaf number obtained is: ' . i . 

' ^ select max(segment) as m"axsegment from analysis, which is then'compa'red to the minimum leaf number. 

Alternatively, a predetermined maximum run time may be a termination condition. For example, the run time of a 
previous pass-fhrough the split node step - 1 44 may be used as. an estimate of the run time for a-next pass. If a next 
pass exceeds the allowed run time, the terrriinatidn condition is met. Upon termination, each item- in the analysis table 
IS classified in one segment which is a leafof the decision tree. It is possible to gather statistics on the leaf nodes as 
with othernodes in step 1 42. * - ' ' ' . 

The step of finding the best split of a node will now be described iri connection with FIG. 1 0. Example SQL 
statements for each step are found in the'attached Appendices 1 to VH. - - . 

First, in step 152. a column is created in the database for counting every value of the target. .The target is then 
counted per unique predictor by node by summing the values'of .the target field for the training set in step 154. The 
example SQL- statement below in Example IV performs steps 152 and 154. 

EXAMPLE IV' " ' • 

CREATE TABLE predsum'(segment INT predictor PtYPE. nO INX n1 INT). PCTFREEO; . 
• - INSERT INTO predsum (segment, predictor. nO. nl) ' ' / * .. . 

SELECTsisegmont; D.i^RE0ICTbR,'dUM(2-s.targe^^ / '. V 

SUM(s'.target-l) " * " ' * 

FROM DATABASE d, ANALYSIS s " * ' * * *./ .' ! 

WHERE d.key-s.key * ' 

GROUP BY s.segment. D.PREDlCTOR; 
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This statement is useful lor simple-datiri types. Example SQL statements for a vanety of datn typos arc found in 
Appendix VII 

The analysis table '100 and the statements shown in the Appendices provide a mechanism for ordonng the data 
to be analyzed In particular, the data are grouped by segment, and for each segment the data are grouped by predictor. 
Thus, identifiers and their corresponding target values are proviaed in sequence by segment for a given predictor 
Thus, for one predictor, its Gini metric is generated for each segment. The next predictor is then used to generate Gini 
metrics for it for each segment. A running total and best predictor is kept for each segment until all predictors have 
been analyzed. 

. Given the set of data to be analyzed for a segment, the target is then cumulated per unique predictor value by 
segment in.step 156. This step may be implemented, for example, by a series of SQL stateitients as shown in Appen- 
dices 1-Vl. SQL is not particularly efficient m the 'processing of the underlined step in Appendices 1 and 11 This step 
is expanded into several, more efficient steps in the SQL of Appendices lll-VI. 

The Gini decrease per predictor is then found for each segment in step 158. This step may be implemented, for 
example, by a series of SQL statements as shown in Appendices l-VI. The maximum Gmi for" each predictor is then 
selected in step 160. also as shown in Appendices l-VI. The best predictor is then selected from this information Finally, 
taoies used in the computation process from the database are dropped in step 1 62. for exarhple using the commands 
at the end of each of appendices l-VI. Steps 156 through 160 may be implemented also using a computer program 
written in another programming language, e.g., "C.- which interacts with a computer program that accesses the data- 
base to perform steps 152-154. 

In the foregoing example, each entity of each segment in the analysis' (able lOO'is analyzed when the best split of 
a node is deteimined. In-most cases however, it is neither necessary nor practical to analyze every entity m the analysis 
taole for all segments being analyzed: In particular a typical classification nriay be performed on a large number of 
entities sampled from a large database; For example, a targe database of data for about forty million entities may be 
sampled to obtain a set of one million entities for classification Such a sample may be represented by the analysis 
tabic 1 00. In order to obtain.a statistically relevant analysis of the predictors and their relationship to a target condition 
on these entities, however it is riot necessary to analyze ail one million of them. The data set may be subsampled 
randomly so as to keep the number of entities small but still statistically significant. For example, node 1 may be split 
into nodes 2 and 3 using, for example, only ten thousand entities' Such subsampling is provided for in the present 
invention by an efficient use of the random number field 106 in the analysis-table 100. 

Subsampling reduces the size of the data set for a segment being analyzed by selecting only a certain percentage 
of the entities in a segment. For exanhple. only those entities having a random number value of. for example, less than 
0. 1 given a random number on a scale of 0 to 1 will allow for a selection of ten percent of -the entities. It has been 
found that a sample size of roughly ten thousand entities per segment is capable of providing enough information for 
splitting the data of a segment. By knowing the number of entities within a given segment, the rate of the number of 
3S .-desired entities to be analyzed to the number of entities in the segment provides the threshold value for the random 
number used in selecting entities for analysis of a best split of a nbde. Such subsampling greatly improves the amount 
of time used to perfonm the classification and regression tree analysis, while maintaining statistical significance of the 
analysis, and can be- readily implemented by modifying the "WHERE" clause of an SQL statement used to selected 
entities from the analysis table 100. such that their randohri number is less than a given threshold. 

Another kind of sampling possible with the analysis table is the selection of only certain segments or nodes to be 
split using the segment numbers to be updated. This kind of sarripling jDermtts memory to be managed efficiently by 
reducing repeated accesses to main storage, e.g.; disk, by processing data as much .as possible in faster memory in 
the i-memory system of the computer. For example, if the integrated circuit memory 38 (FIG. 3) only holds records for 
about thiny thousand entities, then only three nodes should be analyzed at a time. The minimum and maximum node, 
numbers being analyzed, in combination with the first kind of subsampling, provides this capability. 

Another kind of sampling, which is particularly applicable to implementations of CART that operate on a sample 
extracted from a database and stored in memory, involves generating a decision tree until its nodes cannot be split 
into further nodes. Each leaf of this tree is then analysed after the tree is pruned, f^or leaves which could not be split 
and for which there is still a large number" of samples, e.g.. one thousand, nothing further. Is done. Such nodes are not 
likely to have any further significant segmehiation.^For leaf nodes having a. small number of samples, however, a new 
sample of database entities in that node is obtained from the database Such nodes are likely to havea further significant 
segmentation. For implementations which extract a samplef of data f rom the database, typically for about 10,000 entities, 
a ncvy sample of 10,000 entities is extracted; This process, however; involves applying the segment definition of the 
node to be analyzed tcr the database. Applying siich a segrnW definitions, however, may involve. SQL statements . 
which could join more than two tables; An efficient process to avoid joining multiple tables when applying two or more 
segment definitions is described in connmonly-owned European patent application entitled "Computer. System and 
Computer-Implemented Process for Applying Database Segment Definitipins to a Database." filed on even date here- 
with . . • ' i : . • ' - ; 
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The (Durposo of cl.HS'i.ficHHon prccc>ss siion as CART is not meroly to dGtorminQ a sogmcnUtion of one s;^mp!r. 
of c^;:^\f^ accorrjjno In r^ i^r not r.nficliiion R.^thec tho segmont dolinftions dofinorl Ihrounh tho cli^ssiliration procnss m^y 
be applied to other data to make preciclions and to assist in decision making Accordingly, upon completion of the 
classihcation process the definitions or the soamenis and some statistical information .are Generated and stored to 
permit Iheir application to other data . " 

Example tables for storing segment definmons in the relational database are shown in FIG. 11 A-11G There are 
many other ways to store segment definitions, and the invention is not limited tpthis example. One table is the segment 
lest table 110 in FIG. HA. This table includes a column for a segment identifier (SID) llOa for each leaf node m th^^ 
decision tree and column for a lest identifier (TIC> 11 Oh associated with the segment. A seconc table is a test predictor 
table 112 in FIG. 11B. This table includes, for e3ch TID. an indication li2a the table in which the field is located an 
indicator 112b of the. predictor field, the operation n2c performed on the value in the field and the type 11 2d of the 
data (whether categorical or ordered). Example operations are<. >JN. INCLUDES, EXCLUDES. An indication li2e 
of the importance of the test is also provided. The importance of a test may be indicated by. for example, the Gmi metric 
associated with the test. A test value. table 114 in FIG. 11C is alsd'used which includes, for each TID. a list 114 of 
values used for each of the operations m the test predictor table 112. As one example, assume that a segment N was 
defined by the rules A> 1 and A<3. applied to a variable A. a number, in table T The segment test table- 110 would 
include a TID. e.g.. X. for the segnneni N. The test predictor table 112 would include, in two rows tor X. an indication 
of the table T the predictor A. the operations < and > and the type "ordered." The test value table would include for 
the TID X. a row for each of "1" and "3V ' ' ■ 

For compound data types an additional test index table 119. as shown in FIG. IIG. may also be provided The 
test index table includes at least one row for, each TID corresponding to a compound data type. Another column 11 9a 
indicates the table containing the predictor indicated at 119b. The type 1:19c of the predictor and a value Il9d for the 
test are also provided. For example, assume a test uses revenue of a month- as a predjctor where this revenue is 
stored as an array indexed by month. The test predictor table 11 2 could include the following row: 



TID 


Tab 


Pred . 


Op : 


Typ 


Imp - 


1 , 


Perforrnance • 


! Revenue 


< 


numeric 


0.8 



and the test index table could include the fotldwirtg row: 

















•TID 


Tal> ' • ' 


Pred 


Ty.p 


Value ■ 




1 " 


Perforriiance 


Month 


String " 


January 



Additional test predictor test value and test index tables may also be generated for each TID for other tests called 
surrogates. A surrogate is a test on another predictor that produces results which are-statrstically similar to the results 
obtained by the actual test represented by the TID. In such a pase, these tables have the same format as the tables 
for the actual test, but the importariCe value il 2d represents the agreernent between the surrogate and the test This 
agreement may be represented, for example' by a ratio of the number of entities which are classltied in the same 
segment by the different tests to the tOfa'l number of entities to be classified by either test. 

; -Statistical information about the segnri^nls also may be generated and stored using tables 116, 118 and 119 in 
FIGS. 1 1 D-1 IF Each segment has a corresponding row in a lablein one of these-forms.-where the form of the descriptor 
depends on the type of the target: Where the target is a categorical pne,a table 116 as shown in FIG. 11 D includes a 
^ row for each SID. The table includes a column for the number 116a of entities in each segment, and a column for 11 6b 
1 16c etc. each of a set of probabilities R which indicate the probability that an entity in each segment meets the target 
Probability PS is the sum of the probabilities PI . P2. ... Pn. Where the target is a continuous-vatue. a table 118 as shown 
in FIG. 11E includes a row for each SID. and a column for the number 118a of entities N in each segment The table 
also includes a column for the probability PS that an entity meets the target, a column for the sum SX over all entities 
of the target and a colurnn for the surh of squares SSX of the target .value over all entities. .Where the target is a rate, 
a table as 119 shown in FIG. 11F. includes a rowjor each :SI 6, and a column for the number of entities N in each 
segment. The table also includes a column for the probability PS that an enlity meets the target. -and.a column for the 
average rate of all entities in each segment. A columri for the deviance may also be used • . 

Referring now to FIG.* 12. the writing of such tables will now be described. This process is used- to generate the 
tables of FIGS. 11 A-ilG and can be adapted "by those of skill in this art to generate other kinds of tables; 

For each leaf in the selected tree, the following steps are performed. First, in step -170. a buffer is initialized for the 
segment entry of the leaf. Next, a segnrieni number and predicted probabilities are composed in step 1 72. The predicted 
probabilities are based on the training subset. Actual probabilities may also be composed for an evaluation set if one 
IS used.- Test identifiei-s are then generated for each node in the decision tree, using the splits defined for each node 
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to dGlGfmine the table predictor operation type and value used by :he lest These may be obtained 'rom the node 
table described above The test identiliers are cpMected for each ancestor of each leaf The list of lesi identifiers for 
each leaf is stored in the segment test table in FIG 11 A. 

The test identifiers for each leaf are then sorted in step 176 by predictor to keep only the narrowest condition on 
each predictor The tables in FIGS. 1 1 A- 1 1 G are then created using the information collected in steps 1 72 through 1 75 

The seqment descriptions found in FIGS. 1 1 A- 1 1 G may be used to apply to nsw data and to visualize segmentation 
of the database. Application of such segment descriptions to new data is found in commonly owned European Patent 
Application, filed on even oate herewith., entitled "Computer System and Computer-Implemented Process for Applying 
Database Segment Definitions to a Database," which is hereby incorporated by reference. An examp:e of a tool for 
visualization of segmentation of a database is found in commonlyowiied European Patent Application filed on even 
date herewith, enlilled "Graphical User Interface and Display Process for Data Segments in a Computer Database." 
which is hereby incorporated by reference. 

The present invention provides the capability of using CART on data residing in a relational database without 
requiring extraction of a flat file from the database. An analysis table, in the database which represents the data set 
used by CART and which contains a random number field for each entity allows data flow to be controlled so as to 
maximize processing efticiency. address memory management Issues and to control building of a decision tree. This 
data flow control allows a breadth-first search of the predictor space to be performed, which in turn allows a best 
classification to be provided at any lihie during the classification process. - 

Haying now described a few embodiments of the invention, it should be apparent to those skilled in .the art that 
the foregoing is merely illustrative and not limiting, having been presented by way of example only. Numerous modifi- 
cations and other embodiments are within the scope of one of ordinary skill in the art and are contemplated as falling 
within the scope of the invention as defined by the appended claims and equivalents thereto 

Each feature disclosed in the descnption. and (where appropriate) the claims and drawings may be provided 
independently or in any appropriate combination. 
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APPENDIX I - FOR ORDERED PREDICTORS ^^^^-^^ 

^ /* Create a count of the examples of eve^- value of the target per PREDlCTOf^^ 

CREATE TABLE predsum (segment INT, predictor PTYPE, nO JNT . nkll^'^^^^^ 0; 
INSERT INTO predsum (segment, predictor, nO, nl) . . - ' :\\,^y •^f\^:^f'l 

SELECT s.segment, d. PREDICTOR, SUM(2-s.target), SUM(s,target^i^)::. ''V^^ 
to FROM DATABASE d, ANALYSIS s .... "^V-l . '/-'^ypi 

WHERE d.key = s.key . :^ V^/;'^'" 

GROUP BY s.segmcnl, d.PREDlCTOR: ^^^-'Z 



/* Create an index on this table to speed up the. self-join..*/ - 
CREATE INDEX predsumidx OK predsum (segment, predictor): 

/* Create a cumulative count of the examples of every value of the \dx:g^i^v\^^^ 
CREATE TABLE predcum (segment INT, predictor PTYRE: nOl INT, aijTN^^^ 0; 
INSERT FNTO predcum (s:egment, predictor, nOl, rill) ". :'^^|^fi 

■ SELECT a.segmcnt, a.predictor, SUM(b.nOX SUM(b.nl) _ \ ■■ ' 

FROM predsurn a. predsum h . .... . ' ' ■ ''i'^^yj^^^' 

WHERE a.segmenl = b.segment AND a.predictor >= b:predictor; ; ^ 
GROUP BY a.scgmcnt, a.predictor; ? ;T i 

Compute the Gini decrease per predictor. */ . '- ''-j^S^^'ri^^:. 

CREATE TABLE predgini (segment INT, predictor PTYPE, gini FLOAT)jPgSF^E 0; 
30 INSERT INTO predgini (segment, predictor, gini) - - ■ ; / - -^^^ 

SELECT a.segment, a.predictor, . = . 7;?v/ -^-i' ^ 

2*(-(n0i*nl) + n0*nll)*(.(n0l*nl) + n0*nll)/ • ' 
((nO + nl) » (nO + nl) * (nO - nOl nl - nil) * (nOl + nil)) 

FROM predcum a, targsurn b . .: ^ N-^^r^^'fi^^' 

WH^E a.segment = b.segment AND (nO - nOH- nl - nil) != 0; ^ '^^^^^^te: 

/* Find the value of predictor with the maximum decreased ♦/ ^/ 'S^^t^^^^S; 

SELECT * ' . .--^-^ 

FROM predgini a 

WHERE gini IN (SELECT M.AX(gim) . 
FROM predgini b 
^5 WHERE a.segment = b.segment) 

ORDER BY segment; 

/* Drop the tables in the reverse order thev were created. */ . ' v-- -\ : . : 

so DROP TABLE predgini; ' . * 

DROP TABLE predcum; ;• :* -• : 

DROP INDEX predsumidx; V- 
DROP TABLE predsum; • • --^^ ' 

55 




COMMIT; . . ^ ^ ^ /: : 
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APPENDIX II - FOR CATEGORICAL PREDICTORS 

/* Create a count of.the examples of every value of the target per PREDICTOR. */ 
CREATE TABLE predsum (segment INT, predictor. PTYPE, nO INT, nl INT) PCTFPiE 0 
INSERT INTO predsum (segment, predictor, nO, n 1 ) . . 

SELECT s segment, d. PREDICTOR, SUM(2.-s. target), SUM(5.target-l ) 

FROM DATABASE d, ANALYSIS s 

WHERE d.key = s.key 

GROUP BY s.segment, d.PRLDlCTOR; 

/* This should be useful to reduce those high-cardinality categoricals. */ 

CREATE TABLE predsum 1 (segment INT, pnl INT, nO INT, nl INT) PCTFREE 0; 

INSERT INTO predsum 1 (segment, pnl, nO, nl) 

SELECT segment, nl, SUM(nO), SUM(nl) 

FROM predsum 

GROUP B Y segment, n 1 ; 

/♦ Create an index on this table to speed up the self-join. */ 
CREATE INDEX predsumidx ON predsum 1 (segment, pnl); 

/* Create a cumulative count of the examples of every value of the target per x. */ 
CREATE TABLE predcum (segment INT, pnl INT, nOl INT, nil INT) PCTFREE 0* 
INSERT INTO prMcum (segment, pnl vnOIvn^P^^^^ - - ------ - - 

SELECT a.segment, a.pnl, SUM(b.nO), SUM(b.nl) 

FROM predsum! a predsuml h 

WHERE a.segment = b.segmeni AND a.pnl >= b.pnl 

GROUP BY a.segment, a.pnl; 

/* Compute the Gini decrease per predictor. ♦/ 

CREATE TABLE predgini (segment INT, pnl INT, gini FLOAT) PCTFREE 0- 
INSERT INTO predgini (segment, pril, gini) 
SELECT a.segment, a.pnl, 

2 ♦ (.(nOl ♦ nl) + nO * nil) ♦ (.(nOl * nl) - nO * nl 1) / 
((nO nl) ♦ (nO + nl) * (nO - nOl + nl - nil) ♦ (nOl + nil)) 
as gini 

FROM predcum a, targsum b 

WHERE a.segment = b.segment AND (nO - nOl + nl - nl 1) != 0; 

/* Find the value of predictor v^ith the maximum decrease. */ 
SELECT a.segment, a.predictor, b.gini 
FROM predsum a, predgini b 
WHERE a.segment = b.segment AND 
b.gini = (SELECT MAX(gini) 
FROM predgini c 

WHERE b.segment = c.segment) AND 
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n.nl <- b.pnl 
ORDER BY segment; 

/* Drop the tables in the reverse order they were created. 
DROP TABLE predgini; 
DROP TABLE predcum: 

DROP FNDHX predsumidx: • • * ' 

DROP TABLE predsum I : 
DROP TABLE predsum; 
COMMIT; 
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APPENDIX III - FOR CATEGORICAL PREDICTORS 

/* Create a count of the examples of every value of the target per PREDICTOR. */ 
CREATE TABLE predsum (segment INT, predictor PTYPE, nO INT. nl INT) PCTFREE 0; 
INSERT INTO predsum (segment, predictor, nO, nl) 

SELECT s.segment, d. PREDICTOR, SUM(2-s.target), SUM(s.tarcet-l) 

FROM DATABASE d, ANALYSIS s 

WHERE d.key = s.key 

GROUP BY s.scgmem, d.PREDICTOR; 

/♦ Create an index on this table to speed up the self-join. */ 
CREATE INDEX predsumidx ON predsum (segment, nl); 

/* Create a cumulative count of the examples of every value of the target per x. */ 
/* The SELECT should be useful to reduce high-cardinality catcgoricals, */ . 
CREATE TABLE predcum (segment INT, pnl INr,'nOi INT, nil INT) PCTFREE 0; 
COMMIT; 

SET TRANSACTION USE ROLLBACK SEGMENT dbis_rs; ' 
DECLARE 

CURSORcl IS - ' 

SELECT segment, nl, SUM(nO), SUM(nl) 

FROM predsum * - . ' j _ ■ : .„ l '_ 

GROUP BY segment; nl " - , - * 

ORDERBYsegment, nl: * 
asegment predsum. segment%TYPE; 
apnl predsum.nl %TYPE; 
anO predsum.nO%TYPE; 
an I predsum.nl %TYPE; . 
csegment predsum.segment%TYPE := -1; 
cnOl predsum.nOVoTYPE; 

cn 1 1 predsum.n 1 %T YPE; . : • . 

BEGIN ^ . . - . 

OPENcl; 
LOOP 

FETCH c 1 INTO asegment, apn 1 , anO, anl ; 
EXIT WHEN c 1 %NOTFOUND; 

IF (asegment != csegment) THEN 

csegment := asegment; 

cnOl anO; 

cnll := anl ; 
ELSE 

cnOI := cnOl + anO; 

cnll := cnll + anl; 
END IF; 
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INSERT INTO prcdcum VALUES (asecmcnt, apni. cnOl cnl IV 
ENDL0(3P; , ' • ^■ 

CLOSE cl; 
END: 

I -■ ' ■ ■ . 

/■" Compute the Gini decrease per predictor. ♦/ 

CREATE TABLE predgini (segment INT; pnl-!NT,.gini FLOAT) PCTFREE 0- 
INSERT INTO predgini (seamenl, pn I , gini) 
SELECT a.segment, a.pni;, 

2 M-(n01 * nl) + nO * nil) * (.(nOI * ni) + n0.\nll)/ - • 
((nO + nl) * (nO + nl) * (nO . nOl + nl - nil) * (nOl + nil)) 
as gini . . 

FROM predcum a, largsum b . • , 
• V/HERE a.segiTieipl = b.segment -AND (nO -^nOI +;n I - n 1 1). != 0; 

/* Find the value of predictor with the rnaximum decrease. */ 

SELECT a.segment, a.predictor, b.gini 

FROM predsum a, predgini b 

WHERE a.segment = b.segment AND-' . 

b.gini - (SELECT MAX(gini) . .. 

FROM predgini c 

WHERE b.segment = c.segment) AND - * - * 
a.nl <= b.pnl 

ORDER BY segment; 

/♦ Drop the tables in the reverse order thev were created ' */ ' 
DROP TABLE predgini; . ' ■ • *.. . 

DROP TABLE predcum; * . - 

DROP INDEX predsumidx; ' ^ - 

DROP TABLE predsum; 

COMMIT; . .. 
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APPENDIX IV - FOR ORDERED PREDICTORS 

/* Create a count of the examples of every value of the target per PREDICTOR. */ 
CREATE TABLE predsum (segment INT, predictor PTYPE, nO INT, nl INT) PCTFREE 0; 
INSERT INTO predsum (segment, predictor. nO. nl) 

SELECT s.segment, d.PREDICTOR, SUM(2-s.target), SUM(^-target. 1 ) 

FROM DATABASE d, ANAL YSIS. s 

WHERE d, key = s.key 

GROUP BY s.segment, d.PREDICTOR: 

/* Create an index on this table to speed up the self-join. */ 
CREATE INDEX predsumidx ON predsum (segment, predictor); 

/* Create a cumulative count of the examples cf every value of the target per x. V 

CREATE TABLE predcum (segment INT. predictor .PTYPE" nOl fNT, nil INT) PCTFREE 0; 

COMMIT; 

SET TRANSACTION USE ROLLBACK SEGMENT dbi5_rs; 
DECLARE 
CURSOR cl IS 

SELECT segment, predictor, nO, nl •-' 

FROM predsum < . , 

ORDER BY segment, predictor;^ - -- ..rr — : r 7"" " 

asegment predsum. segmem%TYPE; 
apredictor predsum.predictor%TYPE; 

anO predsum.nO%TYPE; . . - 

an I predsum.nl %TYPE; ' " 

csegment predsum.segment%TYPE := - 1 ; 

cnOI predsum nO%TYPE; 

cnll pred$um.nl%TYPE; 
BEGIN 
OPENcl; 
LOOP 

FETCH cl INTO asegment, apredictor, anO, anl : 
EXIT WHEN c 1 %NOTFOLJND; 

IF (asegment != csegment) THEN 

csegment asegment; 

cnOI := anO; 

cnll := anl; 
ELSE 

cnOI := cnOl + anO; 
cnll :== cnll + anl; 
END IF; 

INSERT INTO predcum VALUES (asegment, apredictor, cnOI, cnll); 
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CREATE TABLE pre g s^gMET^T dbls_rs; 

CON^^'Ii^SACTlON^SE ROLLBACK SE 

S£T: TRAILS AC n«andpredic^°'2- / 
O^^^^^ summary' of targets by segmentsandp 
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/* Definitions to noi 
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asegmeni 1>^T; 
apnl I^^"^' 

anO INT; . . • . 

anl l^^*^' , - . Mnhal summary . */ 

u M the read from the global su 

/* Definitions to hold the re 

tsegmentl^lT; 
tnOTNT; 

tnllN'T' a current accumulation. / 

.. .for the current segment ^dcu^en.^ . 
/* Definitions toj tn . 

csegmentlNT:--^' . - . : 

cnOl INT; 

cgini FLOAT. 
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/* The best v alues found so lur. */ 
bgini FLOAT; 
bpnl INT; 
BEGIN 

/• Open the input cursors. "/ 

OPENcl; 

OPENc2; 

/* Loop over the pnl summaries. */ 
LOOP 

/* Fetch the next summary. */ - 

FETCH c 1 INTO asegment, apn IV anO, an I ; , 

EXIT WHEN cl%NpTFOLJND; -" 



/* I f we have switched to a different segment. */ ■ t ; 

IF (asegment != csegmenf)'THEN - ^ 

/* Unless this was the first segment;- urite the .best result out. */ . 
IF(csegment !=-l)THEK ' - • 

INSERT INTO predgini VALUES (csegment. bpnl, bgini): 
END IF; - - . . ■ 

/* Fetch the next segmehfs information. */ "■ 
FETCH c2 INTO tsegment, tnO. tnl'; ' 

EXIT WHEN c2%N0TF0UND; ' . . 

/* Update the segment variable and initialize the cumulative counts..* A 
csegment := asegment; ' - . 

cnOl := anO; ■-. - . 

cnl 1 := anl; ' - " "-■■'■> 

/* Compute the best gini so far for this segment. */ - • 

IF ((tnO - cnOl + tnl - cn 1 1) != 0) THEN 

bgini := 2 * {-(cnOl ' tn 1 ) + tnO ♦ cn 1 1) ♦ (-(cnOl ♦ tn 1 ) + tnO ♦ on 1 1) / 
((tnO + tnl) ♦ (tnO + tril.) * .(tnO - cnOl + tnl - cnl 1) ♦ (cnOl + cni 1)); 

bpnl :=apnl; ' • '"• 

END IF; 

1 • 

ELSE 

/* Update ihe cumulauVe counts; */^ ' 

cnOl := cnOI + anO; ' *. ' * 

cnll ;= cnll + anl; 
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IF ((tnO - cnOI + tn l - cnll) 0) THEN 
/* Compute this pnl's gini. */ 

cgini := 2 * (-(cnOl * tnl) + tnO ♦ cnll) * (-(cnOl * tnl) + tnO * cnll) / 
_ (dnO + tnl) • (tnO + tnl) * (tnO - cnOl + tnl - cnll) ♦ (cnOl + cnll)); 
/* If the best gini so far is worse, replace it with this one. ♦/ 
IF (cgini > bgini) THEN 

bgini ".= cgini; ... 
bpnl := apnl; 
END IF; .... ' . 

END IF; 
END IF; 

END LOOP; '. 
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/* Unless this was the first segment., write the best result of the final segment out ♦/ 
IF(csegmem!=.l)THEN b ■ ■ 

INSERT INTO predgini VALUES (csegment, bpnl, bgini)- 
END IF; 

/* Close the input cursors. */ — . _ _ - — - — . 

" CLOSE c2; 

CLOSE cl; • ■ . . . . 

COMMIT; 

END; ■ ' ■ ■ 

/ 



/* Find the value of pnl with the maximum decrease. •/ 
SELECT a.segment, a.predictoK b.gini 
•»o FROM predsum a, predgini b 

WHERE a.segment = b.segment AND 
b.gini = (SELECT gini 

FROM predgini c 

WHERE b.segment = csegment) AND 
a.nl <= b.pnl 
ORDER BY segment; 

so I* Drop the tables in the reverse order they were created ♦/ 

DROP TABLE predgini; 

DROP TABLE predsum; ' ' - ^ 

COMMIT; 

55 - 
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APPENDIX VI - FOR CATEGORICAL PREDICTORS 

/* Create a count of" the examples of every value of the target per PREDICTOR and */ 

/* create a cumulative c6unt of tlie examples oFevery value of the target per x. */ 
CREATE TABLE predgini (segment' INT, predictor PTYPE, gini FLOAT) PCTFREE 0- 
COMMIT: 

SET TR.ANSACTION USE ROLLBACK SEGMENT dbis_rs; 
DECLARE 

/* Pull in a summary of targets by segments and predictors'. */ 
CURSOR c IIS 

SELECT s.segment, d.PREDICTOR. SUM(2-s.target). SUMfs.target-1).. 

FROM DATABASE d. ANALYSIS s 

WHERE d.key = s.key 

GROUP BY s.segment, d.PREDICTOR - 

ORDER BY s.segment, d.PREDICTOR; 

/* Pul! ill a summary of targets by segments alone. */ 

CURSOR c2 IS ~ 
SELECT segment, nO,nl '- ■ 
FROM targsum 
ORDER by segment; 



/* Definitions to hold in the reads from the predictor summary. */ 
asegment INT; 

apredictor system.dbmain.PREDICTOR%TYPE- 
anO INT; 
an I INT; 

/* Definitions to hold the read'-frbm the global summary. */ ^ 
tsegment INT; 

tnO INT; ' " 

tnl INT; ' ' ''■ 

/* Definitions for the current segment and current accumuIatior>. •/ 

csegment INT := -1; ■ . . . ' r. 

cnOl INT; 

cnllINT; 

cgini FLOAT; 

/• The best values found so far. */ 
bgini FLOAT; 

bpredictor system.dbmain.PREDICTOR%TYPE- 
BEGIN 
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/* Open ihc input cursors. */ 

OPENcl; 
OPENc2; 

/* Loop over the predictor summaries. */ 
LOOP 

/* Fetch the next summary. */ 

FETCH cl INTO asegment, apredictor, anO, anl; 

EXIT WHEN c 1 %NOTFOUND; 

/* If we have switched to a different segment. */ 
IF (asegmcnt != csegment) THEN 

/* Unless this was the first segment, write the best result out. */ 
IF (csegment -1) THEN 

INSERT INTO predgini VALUES (csegment, bpredictor, bgini); 
END IF; 

/* Fetch the ne.Ki segment's information. */ 
FETCH c2 INTO tsegment, tnO, tnl ; . 
EXIT WHEN c2%NOTFOUND; 

/* Update the segment variable and initialize the cumulative counts. */ 
csegment := asegment; 
cnOl := anO; 
cnll :=anl; 

/* Compute the best gini so far for this segment. */ 

IF (((tnO + tnl) ♦ (tnO - cnOl + ml • cnll) * (cnOl + cnll)) != 0) THEN 

bgini := 2 * (-(cnOl * tnl) + tnO * cnll) * (-(cnOl * tnl)-^tnO * cnll)/ 
((tnO + tnl) * (tnO + tnl) * (tnO - 'cnOl + thl - cnll) * (cnOl + cnll)); 

bpredictor := apredictor; • ■ 

END IF; 

ELSE 

/* Update the cumulative counts. */ 
cnOl := cnOl + anO; 
cnii := cnll + anl; 

IF ((mO - cnOl tnl - cnl 1) != 0) THEN 
/* Compute this predictor's gini. */ 

cgini := 2 * (-(cnOI * tnl) + tnO ♦ cnll) ♦ (-(cnOl * tnl) + tnO ♦ cnll) / 
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((tiiO + ml) * (inO + tnl) ♦ (tnO - cnOl + ml - cnll) * (cnOl + cnll)); 

/* If the best gini so far is worse, replace it with this one. */ . 
IF (cgini > bgini) THEN 
bgini := cgini; 
bpredictor := aprcdictor; 
END IF; 
END IF; 
END IF: 

END LOOP; • v- ..' 

/* Unless this was the first segrnent. write the best result of the final segment out. 
IF (csegment 1= -1) THEK " " ' ■■■■I..- -s 

INSERT FNTO predgini: VALUES. (csegment. hpredicipr,. bgini); , 
END IF; 

/* Close the input cursors. */ 

CLOSE c2; i,' . . . '"' 

CLOSE cl; . . - .. - - . 

COMMIT; 

END; .. • c 

/ ■■ . 

I* Find the value of predictor with the maximum decrease. */ 
SELECT * 

FROM predgini . . , . , .. .. ... 

ORDER BY segment; , •.. \- ' • . . .: 

/* Drop the tables: ir) the reverse order they were created. */ ,. • 

DROP TABLE predgini; 

COMMIT; 
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APPENDIX VII - QUERIES FOR DIFFERENT DATA TYPES 

Relational databases have both simple and compound data types which are accessed usina 
different queries. In the creation of segments, an initial query joins the target with the predictor 
dam for each node, creating counts of e:ich unique lar<!el value for each unique predictor value. 
The processing tliat follows this step depends only on whether the predictor is considered 
categorical or ordered and is as described above. 

Data Types 

Simple data types include numbers, dates, and strmgs while compound data types include sets 
and arrays. Smiple data types are represented as records that associate a key with a single value 
while compound data types associate a key with multiple values. WTien the compound data type 
is an array, there exist an additional fields which ser\'e to index each value. 

Queries For Simple Types 

This query creates the summary for both categorical and ordered data: 

CREATE TABLE predsum (segment INT, predictor PTYPE, nO INT. nl INT); 
INSERT INTO predsum (segment, predictor, nO, nl) 
SELECT s.segment, d.predicto^ SUMC2-s.targst), SUM(s.target-l) 

FROM database d, analysis s __ _ _ * - -- 

"'WHERE d.key = sJcey ' " ' ' ' ■ ' ' ' ' ^ * , 

GROUP BY s.segment, d. predictor; 

Queries For Compound. Set Types " .* 

This query- creates the summary for both categorical and ordered data for a single member of the 
set: ' 

CREATE TABLE predsum (segment INT; predictor PTYPE nO INT n I INT)- 
INSERT INTO predsum 

SELECT s.segment. 0, SUM(2-s,target), SUM(s-targei. 1 ) - 
FROM analysis s , ' ' 

WHERE NOT value IN (SELECT dpredicior FROM database d WHERE d kev = s kev) ' 
45 GROUP BY segment 

UNION 

SELECT s.segment, I, SUM(2-s.target), SUM(s.target-l) 
FROM, analysis s 

WHERE value IN (SELECT d.predictor FROM database d WHERE d key = s kev) 
GROUP BY segment; • . - ' ' ' \. * ; 

I fit is desirable to analyze all elements of the set simultaneously; this pair of queries creates the 
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summary for holh categorical and ordered data: 

CRCATC TABLE elements (iicgnierii INT, dcnicni PTYPE)" 

fNSl^RT INTO elements 

SELECT. DLS I INC r s-segment/d.predictur 

FROM analysis database d ' " ' * 

'0 WHERE s.key = d,key; ' * ' 

CREATE TABLE predsum (segment INT, element PTYPE, predictor INT nO INT nl INTV 
INSERT INTO predsum ^ wn i.ni n.i ), 

,5 SELECT e.'segment-e,elcment, 0, SCM(2-s.larg ' ' ' 

FROM elements e, analysis s ' ^ 

WHERE e.segment = s.segmeht AND 

r^on. n V ^-^'^"^^"^'^^^ (SELECT d.predictor FROM database d WHERE d.kev = s <cv) 
OivOLP B\ e. segment, e.eiement - . - " 

UNION 

SELECT e.segment, e.eiement. 0, SUM(2-s.target), SUM(s.taruet-l) 
FROM elements e. analysis. s 
WHERE e. segment = s.segrrient AND 

e.eiement IN (SELECT d^.predicior FROM database d WHERE d kev = s kev) " ' 
GROUP BY e.segmerit, e.eiemem " ' . / 

SELECT s.segment, e.eiement. i,SUM(2-s^^^^^^^ " 
FROM> analysis s • . 

TR^OI^Rv"" (SELECT d.predictor FROM database d WHERE d.kev = s^kev) 
CjKOUP BY segment; ' - ivcvy 

In this case, the additional field, element must be rnrri^»H u l 

inCndcd in GROUP BY and WHE^cS^csrh:' IppXLt""""' """"'''-'^^ 
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Queries For Compound Array Types ' ' . - * ' ■ ■ - - - ^. ■ , 

™s qucn. c,=«es .he s„m„ao. fo, boU, .»„gorical an^ ord„ed.dr„a for-, singjc i„d« 6fth. 



TSnTO p'XS™ "^T- P-dic,or PTYfe. no ,NT. „, ,NT,; 

WHEREd.key = s.key AND d.INDEX = value ■ . 

GROUP BY s.segmcnt, d.predictor; "- ^ ' ' ' ' 



Add multiple index terms to the WHERE ciauQA fXr rr>.iit; ^- ■ . 

to analyze all value., of the index of tL array s l^^^^^^ ^^'"^^^^ 
both categorical and ordered daTa ' ^""uhaneously, th.s query creates the summan, for 
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^^^I'J^^^^^^ P^*^^^"'^ (segment INT. mdex, ITYPE. predictor PTYPE. nO INT. nl INT) 
INShRT unTO predsum 

SELECT s.segment, d.INDEW d.prcdictor, SUM(:-s.targct), SUM(s.tarr^ct-n 
FROM database d, analysis s * * - 

Wl (ERE d.key = s.kcy ' 

GROUP BY s.scgmcnu d.INDEX, d.prcdictor: 

In this case. :he additional field, index, must be carried in each subsequent statement and 
included in their GROUP BY and WHERE clauses when appropriate. Add multiple 'index terms 
to the summary tabic and the GROUP BY clause for multi-dimensional arravs 
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APPF-NDIX Vm - CREA TK AN ANALYSIS TABLE 

/♦ Sample 10.000,000 keys out of l.nOfl.onO.O(K). The key. */ . . .' ... 
/* is drawn from the DATABASE' and inserted into ANALYSIS. */ 

CKE.^TE T.^BLE .XNALYSIS (segment INT. key lNT,.targci INT. subsample FLOAT); 
DECLARE 

/* The inputs. */ - • 

SEEDCONST.ANT DINARS'JNTEGER := 1: .. ; ' ' 
popiilation_si2e BINARY_iNTEGER := 100000000; 
sample_size BIN.ARY_INTECER := 1 000000000; 

.'* The source table. */' 

CURSOR cl IS SELECT key. target FROM D.A.TABASE; 

/" Variables into which to fetch fields. */ 
akey D.ATABASE.key%TYPE; 
atarget DATABASE.larget%TYPE; 

/* Constants used by the random number generator. */ 

IMl CONSTANT BINARY JNTEGER := 2147483563; 

IM2 CONSTANT BINARY_INTEGER := 2147483399; 

AM CONSTANT FLOAT := ( 1 .0/IM I ); 

IMMl CONST.ANT BINARYJNTEGER := (IMl-1); 

I A 1 CONSTANT BINARY_INTEGER := 400 1 4; 

IA2 CONSTANT BINARYJNTEGER := 40692; 

IQ 1 CONSTANT BINARY_INTEGER := 53668; 

IQ2 CONSTANT BINARY_INTEGER := 52774; 

IRl CONSTANT BINARYJNTEGER .= 1221 1; 

IR2 CONSTANT BINARYJNTEGER := 3791; ' 

NTAB CONSTANT BINARYJNTEGER := 32; 

NDIV CONSTANT FLOAT := (i+FLOOR(IMMl/NTAB)); 

EPS CONSTANT FLOAT := 1 .2e-7; 

RNMX CONSTANT FLOAT := (l.O-EPS); 

/* The random state. ♦/ 

TYPE random_iv IS TABLE OF BINARYJNTEGER INDEX BY BINARYJNTEGER; 
iv random J v; 

idnuml BINARY_INfEGER; 
idnum2 BINARY_1NTEGER; 
iy BINARYJNTEGER; 

/* Temporary variables for the random. ♦/ 
j BINARYJNTEGER; 
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k BINARYJNTEGER; 
random_number FLOAT; 
comparable FLOAT; 
sample^numbcr FLOAT; 

/* Temporary variables for the sample. */ 
tested BINARYJNTEGER; 
sampled BrNARY JNTEGER: 

BEGIN 

I"" Initialize the random number generator based on the seed, V 
/* First, insure that the random seed is positive. */ 
IF seed = 0 THEN 

idnuml := 1 ; ' ^ i ; - . : 

ELSIF seed < 0 THEN 

idnuml := -seed; 
ELSE 

idnuml := seed; 
END IF; / 

/♦ Initialize the second generator. V . . " . 

idnum2 |==jdnuml; _ 

/♦ Perfomi 8 warm-ups. */ 
FORj IN 1.. 8 LOOP 
k :=FLO0R(idnuml /IQl); - 

idnuml := FLOOR(lAl * (idnuml -k* lQl)-k* IRl); 
IF idnuml < 0 THEN 

idnuml :- idnuml +IM1; 
END IF; 
END LOOP; 

/* Load the shuffle table. */ 
FORj IN 0..(NTAB-1)LOOP 
k:=FLOOR(idnuml/IQl); 

idnuml :=FL00R{IA1 * (idnuml -k*lQl)-k* IRl); 
IF idnuml <0 THEN 

idnuml := idnuml + IMl ; : ' > 

END IF; 
iv(j) := idnuml; 

END LOOP: .. : . ; - 

/* Save the altered idnuml ; */ 
iy:=idnuml; ' :: ' 



BNSCXXlD <EP 0797160A2 i > 



28 



EP Q 797 160 A2 



/* Get a random sample. */ 
sampled := 0; 

tested := 0: 
OPEN cl; 

WHILE (sampled < sample_size). LOOP 



/* Fetch an observation. */ . . • 

FETCH cl INTO akey. atarget; 
EXIT WHEN c1%n6tFOUND: • 

/* Compute a random number..*/ ■■- 

Compute (lAl * idnuml) "/o IM l without overflows using Schrage's method. 
k:=FLOOR(idmimWIQl); . r:. 

idnuml := FLOOR(lAl » (idnuml - k MQl) - k * IRl); 
IF idnuml < 0 THEN 
idnuml := idnuml + IMl; ...... 

END IF; - ,: 

/* Compute (1A2 * idnum2) % IM2 without overflows using Schrage's method, 
k := FLOOR(idnum2 / IQ2); 

idnum2 := FLOOR(IA2 * (idnum2 - k *:IQ2) r k * JR?); ... _ . 
IF idnum2 < 0 THEN . ■ .,. 

idnum2 := idnum2 + IM2; 
END IF; . , : ■ 

/♦ J willbeintherangeofO..NTAB-l. : • ^. ... 

j := FLOOR(iy /NDIV);; r.^..-v • ; ' • / 

/* Output the previously stored value combined wuh.idnum2.,*/ • 
iy := i'v(j) - idnuni2; 
IF iy < I THEN 

iy iy + IMMl; 
END IF; ■ . ; 

/* Refill the shuffle table. */ 

iv(j) := idnuml; - - . ' - . •„ 

/* Do not use end-point values. */ ;. r 

random_number := AM * iy; 

IF random^number > RNMX THEN , . • . • 

random_number := RNMX; 
END IF; 



/* Compute a sample number. */ 
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/♦ Compule (I A I * idnuml ) % IMl without overflows using Schrayie's method. */ 
k " FLOOR(idnunil / IQl); 
^ idnuml := FLOORdAl * (idnuml - k MQl) - k * IRl); 

IF idnuml <0 THEN 

idnuml := idnuml -r IMl; 
END IF; 

10 

/♦ Compute (IA2 * idnum2) % IM2 without overflows using Schrage*s method. */ 
k:= FLOOR(idnum2/IQ2): 

idnum2 := FLOOR(IA2 * (idnum2 - k * IQ2) - k * 1R2); 
IS IF idnum2 < 0 THEN 

idnum2 := idnum2 + IM2; 
END IF; 

/♦ J will be intherangeofO..NTAB-l. */ ' 
j := FLOOROy / NDI V); 

/♦ Output the previously stored value combined with idnum2. "/ 
iy := ivO) - idnum2; 
IF ly < 1 THEN 

iy := iy + IMMl; , . 

END IF; 

30 7* Renn the shuffle table/*/" ' ^ ; 

iv(j) := idnuml; 

/♦ Do not use end-point values. */ 
3S sample_number := AM * iy; 

IF sample^number > RNMX THEN 
sample_number := RNMX; 
END IF; 

40 

. /* Test the random number to see if this column is to be inserted into the sample. */ 
comparable := (sample_size - sampled) / (population^size - tested); 
IF random_number < comparable THEN 

/* Increment ihe count of samples obtained. */ 
sampled := sampled + 1 ; 

/* Insert the sample into the analysis table. */ 

INSERT INTO ANALYSIS VALUES (1, akey, atarget, sample_number); 
END IF; 

ss I* Increment the count of observations tested. */ 
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tested := tested + I; 
END LOOP; 

END: 

•/ 

COMMIT; 



Claims 

IS 

1 . A computer system for classifying records in a computer database . comprising: 

means tor creating a labia n the computer database for Indicating, for each entity in a sample of the computer 
database, a segment in which the entity is placed: and 
20 means for classifying the entity into segments according to a selection crilerion anc 16' modifying the analysis 

table according to the generated classification. 

2. The computer system of claim. T whereiathe nneans for classifying includes means for identifying suitable selection 
criteria to maximize a probability that an entity in a segment defined by the selection criteria meets a target char- 
ts actcristlc. ' ' 

3. The computer system of claim 1. wherein said table includes a field containing a'random number enabling random 
sub-sampling of the records in -he table. 

30 4. The computer system of claim 3. wherein said random number is tasted against a.threshcid. adjustment of which 
serves to control the sample size of the sub-sampling operation. 
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